46 ◾ Bioinformatics
There are several programs for quality assessment, but FastQC is the most popular one.
FastQC is a user-friendly program to assess the quality of the reads generated by any of the
sequencing technologies, and it produces a report that summarizes the results in graphs
that are easy to interpret. The potential quality problems include low-quality bases, pres-
ence of adaptor sequences connected to the reads, presence of adaptor dimers or other
technical contaminating sequences, overrepresented PCR sequences, sequence length dis-
tribution, per base sequence content, per sequence GC content, per base N content, and
k-mer content. The per base sequence quality and adaptor content are the most important
metrics that we should look at and take the appropriate action. The ideal sequencing data
are the one without warnings or failed metrics. Therefore, we should try to fix the prob-
lems as possible. However, some problem may not be solved. If the unsolved problem does
not affect the reads severely, that data still can be used in the analysis. However, we must
be aware that unsolved problems may have some negative impact in the results. The read
quality problems can be solved based on the failed metrics by removing low-quality reads,
trimming the reads from the beginning and the end of the reads, and masking the bases
with low-quality scores. There are several programs for the processing of raw sequence
data. FASTX-toolkit is the most popular one for single-end FASTQ files, and Trimmomatic
is more sophisticated and can be used for both single-end and paired-end raw data. Fastp
filters low-quality reads and automatically recognizes and trims adaptor sequences. It is
important to process the paired-end FASTQ files (forward and reverse) together to avoid
leaving out singletons, which may not be accepted by almost all aligners. In this chapter, we
discussed the command-line programs for quality controls. However, those programs or
similar ones are implemented in Python, R, and other programing languages, but under-
standing the general principle for checking the raw data quality and solving potential qual-
ity problems are the same. Most sequencing applications use these kinds of QC processing,
but when we cover the metagenomic data analysis, you will learn how to preprocess micro-
bial raw data using different programs. Once the raw sequencing data are cleaned, then we
can move safely to the next step of sequence data analysis depending on the application
workflow that we are adopting.
REFERENCES
1. Holley RW, Everett GA, Madison JT, Zamir A: Nucleotide sequences in the yeast alanine
transfer ribonucleic acid. J Biol Chem 1965, 240: 2122–2128.
2. Jou WM, Haegeman G, Ysebaert M, Fiers W: Nucleotide sequence of the gene coding for the
bacteriophage MS2 coat protein. Nature 1972, 237(5350):82–88.
3. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman
MS, Chen Y-J, Chen Z et al: Genome sequencing in microfabricated high-density picolitre
reactors. Nature 2005, 437(7057):376–380.
4. Braslavsky I, Hebert B, Kartalov E, Quake SR: Sequence information can be obtained from sin-
gle DNA molecules. Proceedings of the National Academy of Sciences 2003, 100(7):3960–3964.
5. Rhoads A, Au KF: PacBio sequencing and its applications. Genomics, Proteomics &
Bioinformatics 2015, 13(5):278–289.
6. Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW: Zero-mode wave-
guides for single-molecule analysis at high concentrations. Science 2003, 299(5607):682–686.